A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments

نویسندگان

  • Xite Wang
  • Derong Shen
  • Ge Yu
  • Tiezheng Nie
  • Yue Kou
چکیده

MapReduce is one of the most popular parallel data processing systems, and it has been widely used in many fields. As one of the most important techniques in MapReduce, task scheduling strategy is directly related to the system performance. However, in multi-user shared MapReduce environments, the existing task scheduling algorithms cannot provide high system throughput when processing batch jobs. Therefore, in this paper, a novel scheduling technique, Throughput-Driven task scheduling algorithm (TD scheduler) is proposed. Firstly, based on the characteristics of shared MapReduce environments, we propose the framework of TD scheduler. Secondly, we classify the jobs into six states. Jobs in different states have different scheduling priorities. We also give the rules of state conversion, which can ensure the fairness of resource allocation and avoid wasting system resources. Thirdly, we design the detailed strategies for job selection and task assignment. The strategies can effectively improve the ratio of local task assignment and avoid hotspots. Fourthly, we show that our TD scheduler can be applied to the heterogeneous MapReduce cluster with small modifications. Finally, the performance of TD scheduler is verified through plenty of simulation experiments. The experimental results show that our proposed TD scheduler can effectively improve the system throughput for batch jobs in shared MapReduce environments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

S : An Efficient Shared Scan Scheduler on MapReduce Framework

Hadoop, an open-source implementation of MapReduce, has been widely used for data-intensive computing. In order to improve performance, multiple jobs operating on a common data file can be processed as a batch to share the cost of scanning the file. However, in practice, jobs often do not arrive at the same time, and batching them means longer waiting time for jobs that arrive earlier. In this ...

متن کامل

Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

MapReduce has become a popular programming model for running data intensive applications on the cloud. Completion time goals or deadlines of MapReduce jobs set by users are becoming crucial in existing cloudbased data processing environments like Hadoop. There is a conflict between the scheduling MR jobs to meet deadlines and “data locality” (assigning tasks to nodes that contain their input da...

متن کامل

Hadoop Map Reduce Job Scheduler Implementation and Analysis in Heterogeneous Environment

Hadoop MapReduce is one of the popular framework for BigData analytics. MapReduce cluster is shared among multiple users with heterogeneous workloads. When jobs are concurrently submitted to the cluster, resources are shared among them so system performance might be degrades. The issue here is that schedule the tasks and provide the fairness of resources to all jobs. Hadoop supports different s...

متن کامل

Using Pattern Classification for Task Assignment in MapReduce

MapReduce has become a popular paradigm for large scale data processing in the cloud. The sheer scale of MapReduce deployments make task assignment in MapReduce an interesting problem. The scale of MapReduce applications presents unique opportunity to use data driven algorithms in resource management. We present a learning based scheduler that uses pattern classification for utilization oriente...

متن کامل

Simulation and performance evaluation of the hadoop capacity scheduler

Hadoop task schedulers like Fair Share and Capacity have been specially designed to share hardware resources among multiple organizations. The Capacity Scheduler provides a complex set of parameters to give fine control over resource allocation of a shared MapReduce cluster. Administrators and users often run into performance problems because they do not understand the performance influence of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014